variation distance
learning
Consideranews recommendation website that, when presented with a new user, sequentially offers a selection of currently trending articles. Such asystem may only haveafewopportunities tomakerecommendations before the user decides to navigate away, leaving little time to correct for misspecified or underspecified prior knowledge.
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Information Technology > Data Science > Data Mining > Big Data (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (11 more...)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (7 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
Learning from Convenience Samples: A Case Study on Fine-Tuning LLMs for Survey Non-response in the German Longitudinal Election Study
Holtdirk, Tobias, Assenmacher, Dennis, Bleier, Arnim, Wagner, Claudia
Survey researchers face two key challenges: the rising costs of probability samples and missing data (e.g., non-response or attrition), which can undermine inference and increase the use of convenience samples. Recent work explores using large language models (LLMs) to simulate respondents via persona-based prompts, often without labeled data. We study a more practical setting where partial survey responses exist: we fine-tune LLMs on available data to impute self-reported vote choice under both random and systematic nonresponse, using the German Longitudinal Election Study. We compare zero-shot prompting and supervised fine-tuning against tabular classifiers (e.g., CatBoost) and test how different convenience samples (e.g., students) used for fine-tuning affect generalization. Our results show that when data are missing completely at random, fine-tuned LLMs match tabular classifiers but outperform zero-shot approaches. When only biased convenience samples are available, fine-tuning small (3B to 8B) open-source LLMs can recover both individual-level predictions and population-level distributions more accurately than zero-shot and often better than tabular methods. This suggests fine-tuned LLMs offer a promising strategy for researchers working with non-probability samples or systematic missing-ness, and may enable new survey designs requiring only easily accessible subpopulations.
- Europe > Austria > Vienna (0.14)
- Europe > Germany > Thuringia (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (9 more...)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (7 more...)